Integrating plasma cell‐free DNA with clinical laboratory results enhances the prediction of critically ill patients with COVID‐19 at hospital admission

Dear Editor, Owing to the substantial clinical heterogeneity of patients infected with SARS-CoV-2,1,2 factors primarily relying upon clinical and/or laboratory parameters are yet inadequate to accurately predict COVID-19 patients evolving to severe or critical illness at early stage.3,4 Recent studies have revealed an elevated level of cell-free DNA (cfDNA) in plasma in severe COVID-19 patients due to massive cell death or irreversible multiorgan injuries during pathological conditions.5,6 Therefore, the utilization of cfDNA profiling may benefit improving the COVID-19 prediction and help understand molecular characteristics of the life-threatening disease.7,8 Herein, we developed an M2Model, a LightGBM-based9 machine learning model with focal loss as an objective function to predict critical COVID-19 at admission by jointly analysing multimodal data, including laboratory parameters and cfDNA profiles. Laboratory results and blood samples were collected from a total of 399 consecutive hospitalized patients with COVID-19 (345 noncritical and 54 critical patients; Table S1). Whole-genome sequencing (WGS) was conducted on plasma cfDNA (Table S2), and we observed a slight shift towards shorter cfDNA fragments in critical patients compared to noncritical patients (Figure S1B). We derived three types of features from the WGS data, including fragment length ratio (denoted as FRAGL), transcription start site coverage score (denoted as TSS) and frequency of 4-nucleotide motifs at 5′ fragment ends (denoted as MOTIF). Together with laboratory results (denoted as LAB; Table S3), we acquired four featuretype-specific datasets with totally 510 features after data preprocessing (Figures 1A and S1A–D). By integrating previous four datasets, the M2Model was trained and evaluated using 100 random training/testing splits based on the optimal hyperparameters and ranked features (Figure 1B, Table S6). For comparison, we applied the same protocol to each dataset, leading to four addi-

Integrating plasma cell-free DNA with clinical laboratory results enhances the prediction of critically ill patients with COVID-19 at hospital admission Dear Editor, Owing to the substantial clinical heterogeneity of patients infected with SARS-CoV-2, 1,2 factors primarily relying upon clinical and/or laboratory parameters are yet inadequate to accurately predict COVID-19 patients evolving to severe or critical illness at early stage. 3,4 Recent studies have revealed an elevated level of cell-free DNA (cfDNA) in plasma in severe COVID-19 patients due to massive cell death or irreversible multiorgan injuries during pathological conditions. 5,6 Therefore, the utilization of cfDNA profiling may benefit improving the COVID-19 prediction and help understand molecular characteristics of the life-threatening disease. 7,8 Herein, we developed an M2Model, a LightGBM-based 9 machine learning model with focal loss as an objective function to predict critical COVID-19 at admission by jointly analysing multimodal data, including laboratory parameters and cfDNA profiles.
Laboratory results and blood samples were collected from a total of 399 consecutive hospitalized patients with COVID-19 (345 noncritical and 54 critical patients; Table S1). Whole-genome sequencing (WGS) was conducted on plasma cfDNA (Table S2), and we observed a slight shift towards shorter cfDNA fragments in critical patients compared to noncritical patients ( Figure S1B). We derived three types of features from the WGS data, including fragment length ratio (denoted as FRAGL), transcription start site coverage score (denoted as TSS) and frequency of 4-nucleotide motifs at 5′ fragment ends (denoted as MOTIF). Together with laboratory results (denoted as LAB; Table S3), we acquired four featuretype-specific datasets with totally 510 features after data preprocessing (Figures 1A and S1A-D).
By integrating previous four datasets, the M2Model was trained and evaluated using 100 random training/testing splits based on the optimal hyperparameters and ranked features ( Figure 1B, Table S6). For comparison, we applied the same protocol to each dataset, leading to four addi- tional single-type feature-based models. The top-predictive features were finally selected once the corresponding model yielded the highest average precision but the lowest focal loss ( Figure 1C and S2A-D). Consequently, the M2Model outperformed other single-type feature-based models in discriminating critical from noncritical COVID-19, achieving the highest AUROC (area under ROC curve) of .955 ± .029 (mean ± SD; Figure 1D) and AUPR (area under precision-recall curve) of .827 ± .153 (p < .0001; Figure 1E). The Brier score for calibration assessment of the M2Model reached the lowest value of .052 ± .025, suggesting its optimal representation of the true critical COVID-19 likelihood (p < .0001; Figure 1F (Table S4).
Although only 21 (4 LAB and 17 TSS) of 510 combined features (4.12%) were identified as top-predictive features by the M2Model (Figure 2A), they accounted for 37.9% of total feature importance ( Figure 2B). Remarkably, TSS features alone contributed the most towards critical COVID-19 prediction ( Figure 2C). Visualization of these 21 features showed complex non-linear functions learned by the M2Model (Figures 2D and S4, S5). Additionally, we also analysed the top features identified by the single-type feature-based models ( Figures S6-S9).
Of particular interest were the above 17 TSS features, of which 9 were significantly lower in critical than noncritical patients (p < .05; Figure 2E), reflecting a great loss of coverage depth in nucleosome-depleted regions around these TSSs ( Figures 2F and S10A  in critical patients could be linked to up-regulated expression of the TSS-associated genes, mainly resulting from the nucleosome occupancy for expressed genes 10 (see the Supporting Information). Pathway and functional enrichment analyses of these genes showed significant correlations with immune-related responses (p < .05; Figure 2G). For example, genes with lower values of TSS features in critical patients such as GSDMD, TNFAIP3, DEFA1 and DEFA1B at chr19:50968972 were enriched in the top-ranked pathway of 'NOD-like receptor', where the related proteins were tightly interactive with each other ( Figure S11). Gene set enrichment analysis indicated that many of these TSSassociated genes were significantly related to COVID-19 (Table S5).
We next clustered all patients into three risk strata according to the cut-off values for critical COVID-19 at 98% sensitivity and 98% specificity ( Figure 3A). We illustrated that our M2Model was able to early predict COVID-19 patients at a risk of deteriorating towards critical illness. For instance, PU8354 was a noncritical patient at admission but deteriorated during hospitalization. Our M2Model exhibited 47% and 26.1% contributions towards critical COVID-19 prediction at admission by the 4 laboratory parameters and 17 TSS features, respectively, leading to an increasing risk of progressing towards critical illness from a prior probability of 1.2% to a posterior of 74.3% ( Figure 3B). Overall survival analysis showed that the high-risk critical COVID-19 patients required a significantly longer length of hospital stay than other two risk groups (p < .005; Figure 3C). The univariate Cox proportional hazard analysis with recovery as the end-point showed that the majority of the identified features were significantly correlated to decreasing the risk of critical COVID-19 (p < .05; Figure 3D). The Spearman correlation analysis also displayed the strong associations between these features and the three risk strata ( Figure 3E). Hierarchical clustering analysis demonstrated that these features were able to yield distinct separation among the three risk groups ( Figure 3F).
In summary, our M2Model was able to reach superior performance in predicting critical COVID-19 at admission based on a compact subset of integrated laboratory parameters and TSS features. The TSS features, reflecting the open status of chromatin regions, displayed the most contribution to the prediction. The identified features with clinical and molecular characteristics had utilities for diagnostics and prognostics, and can serve as markers to monitor the effect of therapeutic interventions on critical COVID-19. Additionally, our approach as a clinicogenomic framework can be easily expanded towards early prediction of deteriorating patients who were initially infected with the emerging SARS-CoV-2 variants such as Omicron. We thereby anticipated that our M2Model had the potential to provide personalized management for individual patients with COVID-19.

A C K N O W L E D G E M E N T S
We would like to thank Dr. Hailin Pan and Jianhui Gong at BGI-Shenzhen for constructive suggestions.

F U N D I N G I N F O R M AT I O N
The study was supported by the Guangdong-Hong Kong Joint Laboratory on Immunological and Genetic Kidney Diseases (No. 2019B121205005) and the National Natural Science Foundation of China (Grant nos. 32171441 and 32000398).